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■ Abstract 

Many learning machines that have hierarchical structure or hidden variables are now being 
\^ ' used in information science, artificial intelligence, and bioinformatics. However, several learning 

l_J , machines used in such fields are not regular but singular statistical models, hence their general- 

^ ' ization performance is still left unknown. To overcome these problems, in the previous papers, we 

^ I proved new equations in statistical learning, by which we can estimate the Bayes generalization 

loss from the Bayes training loss and the functional variance, on the condition that the true dis- 
tribution is a singularity contained in a learning machine. In this paper, we prove that the same 
equations hold even if a true distribution is not contained in a parametric model. Also we prove 
that, the proposed equations in a regular case are asymptotically equivalent to the Takeuchi infor- 
mation criterion. Therefore, the proposed equations are always applicable without any condition 
on the unknown true distribution. 
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^ ■ 1 Introduction 

a^ ■ 

Nowadays, a lot of learning machines are being used in information science, artificial 
^ intelligence, and bioinformatics. However, several learning machines used in such fields, 

;h ' for example, three-layer neural networks, hidden Markov models, normal mixtures, bi- 

nomial mixtures, Boltzmann machines, and reduced rank regressions have hierarchical 
structure or hidden variables, with the result that the mapping from the parameter to 
the probability distribution is not one-to-one. In such learning machines, it was pointed 
out that the maximum likelihood estimator is not subject to the normal distribution 
[51 m El [2], and that the a posteriori distribution can not be approximated by any 
gaussian distribution [11], |13l [HI |15] . Hence the conventional statistical methods for 
model selection, hypothesis test, and hyperparameter optimization are not applicable 
to such learning machines. In other words, we have not yet established the theoret- 
ical foundation for learning machines which extract hidden structures from random 
samples. 

In statistical learning theory, we study the problem of learning and generalization 
based on several assumptions. Let q{x) be a true probability density function and 
p{x\w) be a learning machine, which is represented by a probability density function 
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of X for a parameter w. In this paper, we examine the following two assumptions. 

(1) The first is the parametrizability condition. A true distribution q{x) is said to be 
parametrizable by a learning machine p{x\w), if there is a parameter wq which satisfies 
q{x) = p{x\wo). If otherwise, it is called nonparametrizable. 

(2) The second is the regularity condition. A true distribution q{x) is said to be regular 
for a learning machine p{x\w), if the parameter wq that minimizes the log loss function 



is unique and if the Hessian matrix V'^L{wq) is positive definite. If a true distribution 
is not regular for a learning machine, then it is said to be singular. 

In study of layered neural networks and normal mixtures, both conditions are impor- 
tant. In fact, if a learning machine is redundant compared to a true distribution, then 
the true distribution is parametrizable and singular. Or if a learning machine is too 
simple to approximate a true distribution, then the true distribution is nonparametriz- 
able and regular. In practical applications, we need a method to determine the optimal 
learning machine, therefore, a general formula is desirable by which the generalization 
loss can be estimated from the training loss without regard to such conditions. 

In the previous papers [HI [191 1201 [211 [22] , we studied a case when a true distribution 
is parametrizable and singular, and proved new formulas which enable us to estimate 
the generalization loss from the training loss and the functional variance. Since the 
new formulas hold for an arbitrary set of a true distribution, a learning machine, and 
an a priori distribution, they are called equations of states in statistical estimation. 
However, it has not been clarified whether they hold or not in a nonparametrizable 
case. 

In this paper, we study the case when a true distribution is nonparametrizable and 
regular, and prove that the same equations of states also hold. Moreover, we show that, 
in a nonparametrizable and regular case, the equations of states are asymptotically 
equivalent to the Takeuchi information criterion (TIC) for the maximum likelihood 
method. Here TIC was derived for the model selection criterion in the case when the 
true distribution is not contained in a statistical model ^U\. The network information 
criterion ^ was devised by generalizing it to an arbitrary loss function in the regular 
case. 

If a true distribution is singular for a learning machine, TIC is ill-defined, whereas 
the equations of states are well-defined and equal to the average generalization losses. 
Therefore, equations of states can be understood as the generalized version of TIC 
from the maximum likelihood method in a regular case to Bayes method for regular 
and singular cases. 

This paper consists of six sections. In Section 2, we summarized the framework of 
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Baycs learning and the results of previous papers. In Section 3, we show the main 
results of this paper. In Section 4. some lemmas are prepared which are used in the 
proofs of the main results. The proofs of lemmas are given in the Appendix. In Section 
5, we prove the main theorems. In Section 5 and 6, we discuss and conclude this paper. 

2 Background 

In this section, we summarize the background of the paper. 
2.1 Bayes learning 

Firstly we introduce the framework of Bayes and Gibbs estimations, which is well 
known in statistics and learning theory. 

Let N hea, natural number and R'^ be the A'"-dimensional Euclidean space. Assume 
that an information source is given by a probability density function q{x) on R-^ 
and that random samples Xi,X2, ■■■,X„ are independently subject to the probability 
distribution q{x)dx. Sometimes Xi,X2, are said to be training samples and the 
information source q{x) is called a true probability density function. In this paper we 
use notations for a given function g{x), 

Ex[g{X)] = j g{x)q{x)dx, 

Note that the expectation Ex[g{X)\ is given by the integration by the true distribution, 
but that the empirical expectation E^p\g[Xj)\ can be calculated using random samples. 

We study a learning machine p{x\w) of x e R-^ for a given parameter w e R*^. 
Let ip{w) be an a priori probability density function on R*^. The expectation operator 
Ew[ ] by the a posteriori probability distribution with the inverse temperature /3 > 
for a given function g{w) is defined by 

1 /• " 
Ew[g{w)] = j g{w)^p{w)J{p{Xi\wf dw, 

where Z{I3) is the normalizing constant. The Bayes generalization loss Bg^ the Bayes 
training loss Bt, the Gibbs generalization loss Gg, and the Gibbs training loss Gt are 
respectively defined by 

Eg = -Ex[\ogE^\p{X\w)\\, 

= -£;f)[logE^[p(X,»]], 
Gg = -Ex[E^[\ogp{X\w)]], 

Gt = [£;^[iogp(x,»]]. 
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The functional variance V is defined by 

V = nx Ef{E^[ (logp(X»)2 ] - E^[logp{X,\w)]^}. 

The concept of the functional variance was firstly proposed in the papers 
In this paper, we show that the functional variance plays an important role in learning 
theory. Remark that Bg, Bt, Gg, Gt, and V are random variables because Ey^[ ] 
depends on random samples. Let E[ ] denote the expectation value overall sets of 
training samples. Then E[Bg] and E[Bt] are respectively called the average Bayes 
generalization and training error, and E[Gg] and the average Gibbs ones. 

In theoretical analysis, we assume some conditions on a true distribution and a 
learning machine. If there exists a parameter wq such that q{x) = p{x\wq), then the 
true distribution is said to be parametrizable. If otherwise, nonparametrizable. In 
both cases, we define wq as the parameter that minimizes the log loss function L{w) 
in eq.([T]). Note that Wq is equal to the parameter that minimizes the KuUback-Leibler 
distance from the true distribution to the parametric model. If Wq is unique and if the 
Hessian matrix 

-L{wo) 



dwjdwk 

is positive definite, then the true distribution is said to be regular for a learning ma- 
chine. 

Remark. Several learning machines such as a layered neural network or a normal 
mixture have natural nonidentifiability by the symmetry of a parameter. For example, 
in a normal mixture. 



/27r V27r 

two probability distributions p{x\a,b,c) and p{l — a,c,b) give the same probability 
distribution, hence the parameter wo that minimizes L{w) is not unique for any true 
distribution. In a parametrizable and singular case, such nonidentifiability strongly 
affects learning [111 IIS]- However, in a nonparametrizable and regular case, the a 
posteriori distribution in the neighborhood of each optimal parameter has the same 
form, resulting that we can assume wq is unique without loss of generality. 

2.2 Notations 

Secondly, we explain some notations. 

For given scalar functions f{w) and g{w), the vector Vf{w) and two matrices 
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V f{w)'Vg{w) and V"^ f{w) are respectively defined by 

df{w) 



dwj 



, , . , df{w)dg{w) 
(V/(«;)V,H),. = 



Let n be the number of training samples. For a given constant a, we use the following 
notations. 

(1) Yn = Op{n°') shows that a random variable Yn satisfies |F„| < Crf with some 
random variable C > 0. 

(2) Yn = Opijf) shows that a random variable Y^ satisfies the convergence in probability 

|v;iK^o. 

(3) Un = 0{n°') shows that a sequence ?/„ satisfies < Cn" with some constant 
C > 0. 

(4) Hn = o(n") shows that a sequence ?/„ satisfies the convergence \yn\/n°' — > 0. 

Remark. For a sequence of random variables, it needs mathematically technical pro- 
cedure to prove convergence in probability or convergence in law. If we adopt the 
completely mathematical procedure in the proof, a lot of readers in information sci- 
ence may not find the essential points in the theorems. For example, see [T8t |2H [22] . 
Therefore, in this paper, we adopt the natural and appropriate level of mathematical 
rigorousness, from the viewpoint of mathematical sciences. The notations Op and Op 
are very useful and understandable for such a purpose. 

2.3 Parametrizable and singular case 

Thirdly, we introduce the results of the previous researches [T8|, [191 [201 [21] . We do not 
prove these results in this paper. 

Assume that a true distribution is parametrizable. Even if the true distribution is 
singular for a learning machine, 

(2) 
(3) 
(4) 
(5) 
(6) 
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where 5*0 is the entropy of the true probabihty density function q{x), 

The constants Aq and z/q are respectively the generahzed log canonical threshold and the 
singular fluctuation, which are birational invariants. The concrete values of them can be 
derived by using algebraic geometrical transformation called resolution of singularities. 
By elliminating Aq and z/q from eq.([2])-eq.([6]), 

E[B,] = E[Bt] + iP/n)E[V] + o{^), (7) 

E[G,] = E[Gt] + iP/n)E[V]+oi^), (8) 

hold, which are called equations of states in learning, because these relations hold for 
an arbitrary set of a true distribution, a learning machine, and an a priori distribution. 
By this relation, we can estimate the generalization loss using the training loss and 
the functional variance. However, it has been left unknown whether the equations of 
states, eq.(I7]) and eq.®, hold or not in nonparametrizable cases. 

3 Main Results 

In this section, we describe the main results of this paper. The proofs of theorems are 
given in Section 5. 

3.1 Equations of states 

In this paper, study the case when a true distribution is nonparametrizable and regular. 
Three constants S, A, and u are respectively defined by the following equations. Let 
wq be the unique parameter that minimizes L{w). Three constants are defined by 

S = L{wo), (9) 
A = ^, (10) 
u = ^tr(/J-i), (11) 
where d is the dimension of the parameter, and / and J are d x d matrices defined by 

/ = V \ogp{x\wo)V \ogp{x\wo)q{x)dx, (12) 
. ^ -/VMogp,4«.o),W<^.. (13) 



6 



Theorem 1 Assume that a true distribution q{x) is nonparametrizable and regular for 
a learning machine p{x\w). Then 

m] = s+^ + -+o{-), (14) 

up n n 

E\Bt] = S + ^-- + oi-), (15) 

njj n n 

E[G,] = S + ^ + - + o{-), (16) 
np n n 

E[Gt] = S + ^-- + oi-), (17) 
np n n 

E[V] = j + o{l). (18) 
Therefore, equations of states hold, 

E[B^] = E[Bt] + iP/n)E[V] + oi^), (19) 

E[G,] = E[Gt] + {(3/n)E[V]+o{^). (20) 

Proof of this theorem is given in Section 5. Note that constants are different between 
the parametrizable and nonparametrizable cases, that is to say, S ^ So, X ^ Xq, and 
u 7^ z/q. However, the same equations of states still hold. In fact, eq. f|T9l) and eq. fl20|) 
are completely equal to as eq.(I71) and eq.(IH]), respectively. 

By combining the results of the previous papers with the new result in Theorem 
1, it is ensured that the equations of states are applicable to arbitrary set of a true 
distribution, a learning machine, and an a priori distribution, without regard to the 
condition on the unknown true distribution. 

Remark. If a true distribution is parametrizable and regular, then I = J, hence 
X = u = d/2. If otherwise, / 7^ J in general. Note that J is positive definite by the 
assumption, but that / may not be positive definite in general. 

3.2 Comparison TIC with equations of states 

If the maximum likelihood method is employed, or equivalently if (3 = 00, then Bg 
and Bt are respectively equal to the generalization and training losses of the maximum 
likelihood method. It was proved in [10] that 



TTC 1 

E[Bg]=Em + + o(-) (/3 = oo), (21) 

n n 

where 

TIC = tT{I{wo)J{wo)-'). 
On the other hand, the equations of states, eq. (fT9l) in Theorem 1 show that, 

E[Bg] = E[Bt] + ^^ + o{-). (0</3<oo), (22) 
n n 
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Therefore, in this subsection, let us compare /3V with TIC in the nonparametrizable 
and regular case. 

Let Ln{w) be the empirical log loss function 



L^{w) = -Ef^[logp(X,»] - ;^log¥^( 
Three matrices are defined by 

In{w) = £;f^[Vlogp(X,»Vlogp(X,»], (23) 
Uw) = -Ef[V'\ogp{X,\w)], (24) 
K,{w) = V^Lr,{w). (25) 

In practical applications, instead of T/C, the empirical TIC is employed, 

TICn = tl {In{wMLE)Jn{wMLE)~^), 

where wmle is the maximum likelihood estimator. Then by using the convergence in 
probability wmle Wq, 

E[TICn]=TIC + o{l). 
On the other hand, we have shown in Theorem [H 

E[f3V] = TIC + oil). 

Hence let us compare I3V with TICn as random variables. 

Theorem 2 Assume that q{x) is nonparametrizable and regular for a learning machine 
p{x\w). Then 

TICn = TIC + Op{^), 

(3V = T/C + Op(^), 

'n 



1 



w 



f3V = TICn + OA-)- 

n 

Proof of this theorem is given in Section 5. Theorem 2 shows that the difference 
between (3V and TICn is in the smaller order than the variance of them. Therefore, 
if a true distribution is nonparametrizable and regular for a learning machine, then 
the equations of states are asymptotically equivalent to the empirical TIC. If a true 
distribution is singular or if the number of training samples are not so large, then the 
empirical TIC and the equations of states are not equivalent, in general. Hence the 
equations of states are applicable more widely than TIC. Experimental analysis for the 
equations of states was reported in |l8l [191 120]. The main purpose of this paper is to 
prove Theorems 1 and 2. Its application to practical problems is a topic for the future 
study. 
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4 Preparation of Proof 



In this section, we summarize the basic properties which are used in the proofs of main 
theorems. 

4.1 Maximum a posteriori estimator 

Firstly, we study the asymptotic behavior of the maximum a posteriori estimator. By 
the definition, for each w, 

K„H = j„H + o(^). 

By the central limit theorem, for each w, 

In{w) = I{w) + Op{^), (26) 

\/n 



Uw) = JH + Op(^), (27) 



Kn{w) = J{w) + Op{^). (28) 



The parameter that minimizes Ln{w) is denoted by w, which is called the maximum a 
posteriori estimator (MAP). If /5 = 1, then it is equal to the conventional maximum a 
posteriori estimator (MAP). If /3 = oo, or equivalently = 0, then it is the maximum 
likelihood estimator (MLE), which is denoted by wmle- 

Let us summarize the basic properties of the maximum a posteriori estimator. Be- 
caue Wq and w minimizes L{w) and L^iw) respectively, 

VL(u;o) = 0, (29) 
VL„(w) = 0. (30) 

By the assumption, wq is unique and the matrix J is positive definite, the consistency 
of w holds under the natural condition, in other words, the convergences in probability 
w Wq {n oo) hold for < /5 < cxo. In this paper, we assume such consistency 
condition. 

From eq. fl30l) . there exists w*^ which satisfies 

VLn{wo) + V^Ln{wl){w-wo) = Q (31) 

and 

\w*p — Wo I < l""^ — Wol, 
where | ■ | denotes the norm of R'^. By using the definition Kniw*^) = V'^Ln{w*p), 

w -wo = -Kn{w*py^VLn{wo). (32) 
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By using the law of large numbers and the central limit theorem, Kn{wp) converges 
to J in probability and ^/n VL„(wo) converges in law to the normal distribution with 
average and covariance matrix /. Therefore 

y/n {w - Wo) 

converges in law to the normal distribution with average and covariance matrix 
J^^IJ^^, resulting that 

E[{w - wo){w - wof] = -Lllll + o(-), (33) 

n n 

for < /3 < cxD, where ( Y denotes the transposed vector. In other words, 

w = wo + 0p{^). (34) 
'n 



Hence, 



By using eq. flS^ . 

Since VL„(wo) = Op{l/y/n) and J{wo) is positive definite, we have 

wmle = w + Op{-). (35) 
n 

4.2 Expectations by a posteriori distribution 

Secondly, the behavior of the a posteriori distribution is described as follows. 

For a given function g{w), the average by the a posteriori distribution is defined by 

En>[9{w}\ = — . 

J exp{—njjLn{w))dw 

Then we can prove the following relations. 
Lemma 1 

E^[{w-w)] = Op(i), (36) 
E4iw - w)iw - wf] = :^^^ + 0,(^), (37) 

Eyj[{w - w)i{w - w)j{w - w)k] = <^p(^)' (38) 
E^[\w-wr] = 0,(^) (m>l). (39) 
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Moreover, 

EE^[{w-Wo){w-Wof] = ^ + ^ + o(-). (40) 

n njj n 

EE^[\w-wq\''] = o(^). (41) 

For the proof of this lemma, see Appendix. 

Let us introduce a log density ratio function f{x,w) by 

f{x,w) = log . I . ■ 
p[x\w) 

Then f{x,wo) = and 

Vf{x,w) = —V log p{x\w), 
V^/(x,w) = —V'^ log p{x\w). 

In the proof of Theorems 1, we need the following six expectation values, 

= EEx[E^[f{X,w)]], 

D2 = {l/2)EEx[E^[f{X,wf]l 

D, = {l/2)EEx[E4f{X,w)]'], 

= EEf\E4f{Xj,w)]l 



D, = {l/2)EEf[E^[f{X„w 
The constant /x is defined by 



D, = {l/2)EE^-\E^[f{X,,w)r]. 



/.= ^tr(JJ-i/J-i). (42) 
Then we can prove the following relations. 

Lemma 2 Let v and /i be constants which are respectively defined by eg. / flT]) and 
eq.^^). Then 

D +^ + ) 

2nP n n ' 

D2 = -5 + - + o(-), 

np n n 

Ds = ^ + o(-), 
n n 

D ^ ^ + {^) 

2n(3 n n ' 

^ M /In 

= -^ + - + 0-, 

np n n 

D, = ^ + o(-). 

n n 

For the proof of this lemma, see Appendix. 
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5 Proofs 



In this section, we prove theorems. 

5.1 Proof of Theorem [T] 

Firstly, by using the definitions 

S = L{wo) = -Ex[logp{X\wo)], 
p{x\w) = p{x\wo) exp{—f{x,w)), 

the Bayes generahzation loss is given by 

E[Bg] = -EExlogE^[p{X\w)] 

= S-EEx[\ogE4exp{-f{X,wm 

= S-EEx[logE^{l-f{X,w) + ^^^^)]+o{^) 

= S + EExE4f{X,w)] - ^EExE^[f{X,w)^] 

+ ^EEx[E4f{X,w)]'] + o{^) 

= S + Di- D2 + D3 + o{-) 

n 

d V V , 1 , 

= •5+7r-^--^ + - + «(-)- 

Inp np n n 

Secondly, the Bayes training loss is 

E[Bt] = -EEfhogE^\p{X,\w)] 

= S - EEf\\ogE^[eM-f{X,,w))]] 

Wri^. p (^ _ f(v. „„^ , fjXj.w) 

1 

+\EEf[E4f{X„w)r]+o{^) 

= S + Di- D5 + De + oi-) 

n 

— S + ^ ^ ^ + o{^) 
2n(3 n(3 n n 

Thirdly, the Gibbs generalization loss is 

E[Gg\ = -EExE^[\ogp{X\w)] 
= S + EExE4f{X,w)] 
= 5 + 

— S + ^ + ^ + o(^) 
2nP n n 
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S - EEf\\ogE^{l - w) + + o 

S + EEf^E4f{X,,w)] - ^EEfE^[f{X^,wf 



Forthly, the Gibbs training loss is 

E[Gt] = -EEPE^[logp{X\w)] 
= S + EEPE.4fiX,w)] 
= S + Di 

— S + ^ ^ + o{^) 
2n(3 n n 

Lastly, the functional variance is given by 

E[V] = 2n{D,-De) 

= 2n{D2-D,) + o{l) 

Therefore, we obtained Theorem 1. 
5.2 Proof of Theorem [2] 

Let V^[f{X,w)] be the variance of f{X,w) in the a posteriori distribution, 
V^[f{X,w)] = E^[f{X,wf] - E^[f{X,w)f. 

Then 

V^lfiX, w)] = w) - f{X, w)] 

holds because /(X, w) is a constant function of w. By the Taylor expansion at w = w, 

f{X, w) - f{X, w) = V/(X, w)-iw- w) 
+ ^{w - w) ■ V^f{X, w){w -w) + 0{\w - wf). 
Using this expansion, and eq. (|36ll . eq. (|3711 . eq. (l38l) . and eq. (!39l) . 

V4f{X,w)] = V4{Vf{X,w)) ■ {w-w)] + Op{\). 



Hence 



(3V ^ n(3Ef[VM{Xj,w)]] 

= n(3Ef{E^[{Vf{X,,w) ■ {w - w)f] 

-E^VfiXj, w)-{w- w)]'} + Op(^). 

The second term is Op{l/n) by eq. (!36!) . Therefore, by applying eq. (l37I) to the first 
term, 



f5V = n(3tT[^Ef[{Vf{X,,w)){Vf{X„w))^] 
xE^[{w - w){w - wf] \ + Op(-) 

/ 77- 



tTil4w)K-\w)) + 0,i^: 
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Therefore, by using eq. flS^ . proof of Theorem 2 is completed. 
6 Discussion 

Let us discuss the results of this paper from the three different points of view. 

Firstly, we discuss the method how to numerically calculate the equations of states. 
The widely applicable information criterion (WAIC) [TH [22] is defined by 

n 

WAIC = -^logE^[p(X»] 

i=l 

+/5E|^«,[(logp(X»)2] - E4\ogp{X,\w)f\. 
i=i ^ 

Then by Theorem 1, 

E[WAIC] = E[nBg] + o(l) 

holds. Hence by minimization of WAIC, we can optimize the model and the hyper- 
parameter for the minimum Bayes generalization loss. In Bayes estimation, a set of 
parameters {wk} is prepared so that it approximates the a posteriori distribution. 
Sometimes it is done by the Markov chain Monte Carlo method, and we can approxi- 
mate the average by the a posteriori distribution by 

^ k=i 

Therefore the WAIC can numerically calculate by such a set {wk}- 

Secondly, we study the fluctuation of the Bayes generalization error. In Theorem 1, 
we proved that, as the number of training samples tends to infinity, two expectation 
values converge to the same value, 

E[n{Bg-Bt)] ^ tr(/J-i), 
E[I3V] ^ tr(/J-i). 

Moreover, in Theorem 2, we proved the convergence in probability, 

(3V tr(/J-i). 

On the other hand, by the same way as Theorem 1, we can prove 

n{Bg - Bt) = nx tr(/(w - Wo){w - wq)'^) + Op(l). 

Since \/n{w — Wq) converges in law to the gaussian random variable whose average 
is zero and variance is J ^IJ ^, the random variable n(^Bg — Btj converges to not a 
constant in probability but to a random variable in law. In other words, the relation 
between expectation values 

E[Bg] = E[B,] + ^^ + o{-) (43) 
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holds, whereas they are not equal to each other as random variables, 

3V 1 

B,^Bt + ^ + Opi-). (44) 
n n 

Note that, even if the true distribution is paramertrizable and regular, the generaliza- 
tion and training losses have same properties, therefore both AIC and TIC have same 
properties as eq. fH3|) and eq. fH^ . 

Lastly, let us compare the generalization loss by the Bayes estimation with that by 
the maximum likelihood estimation. In a regular and parametrizable case, they are 
equal to each other asymptotically. In a parametrizable and singular case, the Bayes 
generalization error is smaller than that of the maximum likelihood method. Let us 
compare them in a nonparametrizable and regular case. 

prRl c . ""^^^''^ . rf-tr(/J~i) , 1 

When (3 = oo, this is the generalization error of the maximum likelihood method. If 
d > tr(/J^^), then E[Bg] is a decreasing function of 1/(3. Or if c? < tr(JJ~^), then 
E[Bg] is an increasing function of If / < J, then tr(/J^^) < d. By the definition 
of / and J, 

/q(x) 
Vp{x\Wo)Vp{x\Wo)^-. r^rfx 



and 



where 



p{x\wo) 

J = I-Q, 



Q = I (wM^M) -rr-^dx. 

J p[x\wo) 
If Q < 0, then tr(/J^^) < d, resulting that the generalization loss of Bayes estimation 
is smaller than that by the maximum likelihood method. 

Example. For w e R, 

1 {x — wY 



p(x\w) = , — exp(- 



Then 

If 1 

L{w) = - (x — wYq{x)dx H — log(27r). 

Hence wq = Ex[X] and 

LK) = r(X) + ilog(27r), 
where V{X) = Ex[X'^] - Ex[X]\ The value Q is 

Q = ViX) - 1. 
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If V{X) > 1, then the generahzation error is a decreasing function of in other 
words, the Bayes estimation makes the generahzation loss is smaher than that by the 
maximum hkehhood method. Hence, in a nonparametrizable case, it depends on the 
case which estimation makes the generahzation loss smaller. 

7 Conclusion 

In this paper, we theoretically proved that equations of states in statistical estimation 
hold even if a true distribution is nonparametrizable and regular for a learning machine. 
In the previous paper, we proved that the equations of states hold even if a true 
distribution is parametrizable and singular. By combining these results, the equations 
of states are applicable without regard to the condition of the true distribution and 
the learning machine. Moreover, the equations of states contains AIC and TIC in the 
special cases. 
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8 Appendix 

8.1 Proof of Lemma [1] 

By using eq.(l30l). VL„(w) = 0, in a neighborhood of w, 

Ln{w) = Ln{w) + ^{W - W) ■ Kn{w){w - w) + r{w), 

where r{w) is given by 
1 

Hence, for a given function g{w), the average by the a posteriori distribution is given 
by 

/ g{w) expl —^{w — w) ■ Kn{w){w — w) ~ n/3r{w) jdw 

EM^)] = ^ 

J expl —^{w — w) ■ Kn{w){w ~ w) — nl3r{w) jdw 

The main region of the integration is a neighborhood of w, \w — w\ < e, hence by 
putting w' = y/n{w — w), 

J g{w + ^) exp(-f ■ K^{w)w' - ^ + 0,{^))dw' 



En,[g{w)] 



Jexp{-lw' ■ K^{w)w' - ^ + 0,{}^))dw' 
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where 6{w') is the third-order polynomial, 
By using 

/ I35iw') ^ , l3S{w') ^ A, 



it follows that 



^"'l''''"'! = jexp(-f././f..(u.)u-)rfu- + 

Hence by putting g{w) = w — w, we obtain eq. fl36|) and by putting g{w) = {w — w){w — 
w)'^, eq. (!37|) . By the same way, eq. (!38|) and eq. (l39l) are proved. Let us prove eq. (l40!) . 
By using eq. (l371) . 



Eyj[{w - wo){w - Wo 



w\ , ^ w' 



Eu,[{w -wq + —;=){w -wo + —=) 
'n \ n 



n w^i 
= {w- Wo){w - wof + + 0,{^). (45) 

Then by applying eq. (l33!) . eq. (l40|) is obtained. Lastly, in general, 

\w — Wq\^ < 3(|w — w\'^ +\w — Wop)- 

Then, by eq. (lMI) and eq. (|5^ . eq. lHTl) is derived. Therefore we have obtained Lemma 
1. 

8.2 Proof of Lemma [2] 

By the Taylor expansion of f{X,w) at Wq, 



f{X,w) = Vf{X,wo)-{w-Wo) 

+ ^{w - Wo) ■ V^f{X\wo){w - Wo) 
+f3{X,w), (46) 



where f3{X,w) satisfies 

\MX,w)\<C{X,w)\w-wof 

in a neighborhood of Wo with a function C{X,w) > 0. Let us estimate Di,...,Dq. 
Firstly, by using eq.(|29l) and eq. lHTl) . 

Di = ^EE^Ex[iw-wo)-V^f{X,wo){w-wo)]+o{-) 

Zi Til 

1 1 

= -^EEj^{w - Wo) ■ J{w - wo)\^ o{--). 
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Then by using the identity 
and eq. fHOjl . 

D, = lEE^[tT{{J{w-Wo)){w-Wof)]+o{-) 
2 n 

+ - + 0{-). 



2n(3 2n n' 
Secondly, by using the identity 

{\fu,v e R'^) {u ■ = tr((MM^)(OT^)), 

the definition of I, and eq. fj^Ol) . 

D2 = {l/2)EE4Ex[{Vf{X,wo)-{w-wo)y]] + o{^) 

= {l/2)EE4tiiIiw - wo)iw - wof)] + o(-) 

n 

tr(/J-i) ^ tr(/ J-i/J-i) ^ ^^1^^ 
2nf3 2n n 

Thirdly, by the definition of /, eq. (l36il . and eq. (l33l) . 

1 



D3 = {l/2)EEx[E4Wf{X,w^)-{w-w^)Y]+o{-) 



2n n 
Fourthly, by the Taylor expansion eq.fHG 

.{n)r 



l2l 

-|- 

{ll2)EEx [(V/(X, w,) ■ {w - w^)f] + o(-) 

n 

{l/2)E[ii{I{w - wo){w - Wof)] + o(-) 

n 

tr(/ J-i/J-i) 

+ o{-). 



D, = EE^[Ef\Vf{Xj,wo)-{w-Wo)]] 

+]^EE^[Ef\{w - Wo) ■ V'f{X„ wo){w - Wo)]] + o(^; 

.(n)r 



= E[Ep[Vf{Xj,wo)]-E^[w-Wo]] 

+ -EE^[{w - Wo) ■ Jn{wo){w - Wo)]] + o(-). 
2 n 

Then by using E[Jn{wo)] = J, the second term is equal to Di. To the first term, we 

apply eq. fl36|) and 

Ef[Vf{X„wo)] = VL^wo) + Op(^), 

we obtain 

D4 = E[{VLn{wo))-{w-wo)]+Di + o{^). 
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Then applying eq. fl^ . Kn{wo) J, and w*^ ^ Wq, 

= -E[{Kn{w;){w-wo))-{w-wo)]+Di + o{-) 
d tr(/J-i) 1 




2n/3 2n 



And lastly, by the definitions, 



D, = {l/2)EE4EJ[{Vf{X„wo)-{w-wo)y]] + o{^), 
= {l/2)EE^[E4Vf{X„wo)-{w-wo)]']+o{-). 



By using the convergences in probability, /n(wo) — > I and Jn('U^o) J, it follows that 



which completes Lemma 2. 
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